Yet Another Language Identifier

نویسنده

  • Martin Majlis
چکیده

Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages brings new difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used. The Support Vector Machine algorithm achieved an accuracy of 98% for 90 languages and the YALI algorithm based on a scoring function had an accuracy of 95.4%. The YALI algorithm has slightly lower accuracy but classifies around 17 times faster and its training is more than 4000 times faster. Three different data sets with various number of languages and sample sizes were prepared to overcome the lack of standardized data sets. These data sets are now publicly available.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Culture and Language Education

There are different views on the relationship between language and culture. Some consider them as separate entities one being a code-system and the other a system of beliefs and attitudes. Some believe in a cause and effect relationship between the two; and yet others argue for a co-evolutionary mode of interrelation. This paper will subscribe to the Hallidayan co-evolutionary view of the relat...

متن کامل

طرح ادغام سرشاخه خوشه طب سنتی ایران در ساختار ابَراصطلاحنامه « نظام زبان واحد پزشکی (UMLS)»

Background & Aim: Unified Medical Language System (UMLS) is an extensive ontology of biomedical knowledge developed and maintained by U.S. National Library of Medicine (NLM). Traditional Iranian Medicine (TIM) does not have any position in the structure of metathesaurus of UMLS. The main aim of this study was designing a scheme of TIM cluster's crotch mapping in the structure of metathesaurus o...

متن کامل

Yet Another Application of the Theory of ODE in the Theory of Vector Fields

In this paper we are supposed to define the θ−vector field on the n−surface S and then investigate about the existence and uniqueness of its integral curves by the Theory of Ordinary Differential Equations. Then thesubject is followed through some examples.

متن کامل

Language identification on code-switching utterances using multiple cues

Code-switching speech is an utterance containing two or more languages. Usually, the switching linguistic unit is in clause or word levels. In this paper, a two-stage framework is proposed, containing a language identifier and then a speech recognizer, to evaluate on a Mandarin-Taiwanese codeswitching utterance. In the language identifier, we use multiple cues including acoustic, prosodic and p...

متن کامل

Covenant, Promise, and the Gift of Time

If we categorize religions according to whether they give greater prominence to time or to space, the role of “promise” marks a religion of covenant as clearly a religion of time. Yet the future is unknowable and can only be present to us as a field of possibilities. How far do these possibilities extend? The question directs us back to the nature of time, a question that became concealed in th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012